High-quality annotation of promoter regions for 913 bacterial genomes

نویسندگان

  • Vetriselvi Rangannan
  • Manju Bansal
چکیده

MOTIVATION The number of bacterial genomes being sequenced is increasing very rapidly and hence, it is crucial to have procedures for rapid and reliable annotation of their functional elements such as promoter regions, which control the expression of each gene or each transcription unit of the genome. The present work addresses this requirement and presents a generic method applicable across organisms. RESULTS Relative stability of the DNA double helical sequences has been used to discriminate promoter regions from non-promoter regions. Based on the difference in stability between neighboring regions, an algorithm has been implemented to predict promoter regions on a large scale over 913 microbial genome sequences. The average free energy values for the promoter regions as well as their downstream regions are found to differ, depending on their GC content. Threshold values to identify promoter regions have been derived using sequences flanking a subset of translation start sites from all microbial genomes and then used to predict promoters over the complete genome sequences. An average recall value of 72% (which indicates the percentage of protein and RNA coding genes with predicted promoter regions assigned to them) and precision of 56% is achieved over the 913 microbial genome dataset. AVAILABILITY The binary executable for 'PromPredict' algorithm (implemented in PERL and supported on Linux and MS Windows) and the predicted promoter data for all 913 microbial genomes are available at http://nucleix.mbu.iisc.ernet.in/prombase/.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Generic eukaryotic core promoter prediction using structural features of DNA.

Despite many recent efforts, in silico identification of promoter regions is still in its infancy. However, the accurate identification and delineation of promoter regions is important for several reasons, such as improving genome annotation and devising experiments to study and understand transcriptional regulation. Current methods to identify the core region of promoters require large amounts...

متن کامل

Deppdb - DNA electrostatic Potential Properties Database: electrostatic Properties of genome DNA Elements

The electrostatic properties of genome DNA influence its interactions with different proteins, in particular, the regulation of transcription by RNA-polymerases. DEPPDB--DNA Electrostatic Potential Properties Database--was developed to hold and provide all available information on the electrostatic properties of genome DNA combined with its sequence and annotation of biological and structural p...

متن کامل

Annotation confidence score for genome annotation: a genome comparison approach

MOTIVATION The massively parallel sequencing technology can be used by small research labs to generate genome sequences of their research interest. However, annotation of genomes still relies on the manual process, which becomes a serious bottleneck to the high-throughput genome projects. Recently, automatic annotation methods are increasingly more accurate, but there are several issues. One im...

متن کامل

Next-Generation Annotation of Prokaryotic Genomes with EuGene-P: Application to Sinorhizobium meliloti 2011

The availability of next-generation sequences of transcripts from prokaryotic organisms offers the opportunity to design a new generation of automated genome annotation tools not yet available for prokaryotes. In this work, we designed EuGene-P, the first integrative prokaryotic gene finder tool which combines a variety of high-throughput data, including oriented RNA-Seq data, directly into the...

متن کامل

MIPS bacterial genomes functional annotation benchmark dataset

MOTIVATION Any development of new methods for automatic functional annotation of proteins according to their sequences requires high-quality data (as benchmark) as well as tedious preparatory work to generate sequence parameters required as input data for the machine learning methods. Different program settings and incompatible protocols make a comparison of the analyzed methods difficult. RE...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Bioinformatics

دوره 26 24  شماره 

صفحات  -

تاریخ انتشار 2010